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Abstract 

We study the statistical properties of contact vectors, a construct to charac- 
terize a protein's structure. The contact vector of an iV-residue protein is a list 
of N integers rtj, representing the number of residues in contact with residue 
i. We study analytically (at mean-field level) and numerically the amount of 
structural information contained in a contact vector. Analytical calculations 
reveal that a large variance in the contact numbers reduces the degeneracy 
of the mapping between contact vectors and structures. Exact enumeration 
for lengths up to N = 16 on the three dimensional cubic lattice indicates 
that the growth rate of number of contact vectors as a function of N is only 
3% less than that for contact maps. In particular, for compact structures we 
present numerical evidence that, practically, each contact vector corresponds 
to only a handful of structures. We discuss how this information can be used 
for better structure prediction. 

I. Introduction 



The protein folding problem has been the subject of extensive research in the last decade 
and although much has been learned a satisfactory understanding of the phenomenon has 
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not been reached yet ■ The physical approach to the problem is to consider the na- 
tive state of a protein as the ground-state of a Hamiltonian which acts on sequence space 
and summarizes the inter-residue and residue-solvent interactions Recently it was 

shown that there are several cases for which there is no possible choice of pairwise contact 
interactions between residues that suffices to pin down the native state even for a single 
protein ]5]||. This conclusion is supported by molecular dynamics studies fjj and lattice 
models |§ on residue-solvent interactions, where many-body forces are shown, or can be 
deduced to be, as relevant as two-body forces. To get around this failure of the two-body 
Hamiltonian approach while retaining a coarse-grained description (as opposed to, say, an 
all-atom one, including water 0), we need to introduce new terms at the residue level, to 
bias the optimization procedure towards the true minima. 

It is widely accepted that hydrophobicity is the force driving the folding process jnj . At 
the individual residue level, hydrophobicity is correlated with the solvent-exposed surface 
area in the native state |TI[. In addition, as reported below, a statistical analysis of the 
native structures deposited in the Protein Data Bank (PDB) flT2"| ) reveals a good correlation 
(coefficient of correlation 0.8) between the solvent-hidden surface area per residue and the 
number of inter-residue contacts per residue in the native state. We therefore propose the 
following two-step procedure for predicting the native state of a protein. First, a reasonably 
accurate prediction of the exposed surface area in the native fold is made on the basis of 
sequence information Second, this information is translated into a prediction of the 

number of native contacts of each residue, e.g. to a predicted native contact vector. Even if 
this scheme will turn out to be insufficient to perform a successful prediction, it opens the 
possibility to confine the search for the native fold to a small portion of the conformational 
space. The question then becomes "How many folded configurations are there, consistent 
with a given set of contact numbers ?" And, for that matter, "Is such a contact-number 
representation of the protein structure degenerate at all ?" The rest of this paper addresses 
these questions. 
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II. Contact maps versus contact vectors 



The contact map (CM) [Tj| of a protein of N amino-acids is a symmetric binary matrix C 
of size N x N, such that Cjj = 1 when the i th and the j th amino-acids of the sequence are 
neighbors, with some suitable definition of "neighborhoodness" (e.g., a common construct is 
to threshold the pairwise distance matrix for the C a atoms ||). The CM has proven to be a 
convenient encoding of the 3-dimensional native fold: 1. The native backbone conformation 
can be reproduced to within ~ 1.5 A average uncertainty (the same as most X-ray data) 



IJ], 2. It allows for an efficient search of the configuration space, since large conformational 
changes can be obtained by minor modifications of the CM fll5| . Within such minimalistic 
framework one hopes to gain new insight to the protein-folding problem since it is amenable 
to different physical and mathematical tools. For instance, the following Hamiltonian acting 
on the contact map space has been extensively used in the past ]|16| -|19|1 : 



H = ^w(a h a j )C ij , (1) 

where w(a,/3) is one of the 210 energy parameters representing the contact energy between 
the amino-acid types a and (3. Unfortunately, this formulation has limited predictive power. 
For example, given a large enough set of sequences and decoys (obtained by threading) 
from the PDB, no set of w(a,/3) exists, for which H has its ground-states at the native- 
folds ||. This is in accordance with the recent studies on the nature of the hydrophobic 
interaction @||, whose conclusion is that many-body interactions are of the same order of 
magnitude as two-body interactions. 

One possible way to improve the Hamiltonian in Eq. (|l|) is to include an energy penalty 
for deviations from the native-contacts: 

H = (1 - A) ]T w(a h a,) C - + A - n™*) 2 (2) 

ij i 

where we define a "contact vector" (CV) n of rank N, which is the sum of the entries of the 
CM on each row (or column) (see Fig. ([I])) : 
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n, 



£<V (3) 



Contact vectors have already been studied in the context of protein folding |20-28[|. We 
note in particular that the second term in Eq. (fj) resembles a hydrophobic term introduced 
previously [^9| and studied in Ref . |30| , with the difference that there the desired number of 
contacts of residue % is determined by its species. Here instead we assume the knowledge of 
n nat^ correct number of contacts of residue i in the native structure. Hence the second 
term in Eq. (g) carries the same spirit as the Go model . In this work we are interested 
in studying the statistical properties of contact vectors. For our more general purpose, it 
would seem inconsistent to use Eq. (0) to predict the native structure of a protein, as we 
bias the Hamiltonian towards the minimum by using information which is not accessible to 
us before we actually solve the problem. However, unlike in the Go model, the information 
required here about the native state (the number of contacts for each residue) is modest, 
and, most crucially, can be predicted. Learning algorithms have been recently developed, 
which are trained on known structures to predict the surface exposure of the amino-acids 



in the native fold [11,32]. Since the hydrophobic effect is driving the folding process [10 



it is natural to expect that an accurate prediction of the solvent exposed surface of each 
residue in the folded state may lead to prediction of the correct native structure. To bridge 
the gap between the exposed-surface information and the CV defined above, we performed 
an analysis on a representative set of proteins from the PDB database. We found a linear 
correlation with a coefficient of correlation of 0.8 between the solvent hidden surface area 
of a residue and the number of amino-acids it is in contact with (see Fig.(§)). Therefore, 
in future work we expect to replace the rtf at term in Eq. (0) by n P redtcted ^ thereby breaking 
the causality loop which is a characteristic of Go-like models. Another reason to study the 
model of Eq. (fj) is that a related kind of Hamiltonian has been recently proved to be useful 
to determine the structure of nearly-native protein conformations |33|]. In that study, 
represent the number of native contacts formed by residue i in the contact map C. Also in 
that case, it was found that a large variance in rii (see below) implies a low degeneracy in 
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mapping between contact vectors and three-dimensional conformations. 

In studying Eq. first we tried to use a set of contact energy parameters w{a^ a,-), 
found earlier by an optimization process, using Eq. (0) with A = 0. This attempt failed to 
assign the minimal energy to the native state for any choice of A. However, an optimization 
of w(a,j3) over the known structures by using the Hamiltonian in Eq. with A 7^ may, 
perhaps, successfully identify the native state. We will investigate this possibility in the 
future. 

The Hamiltonian (0), with A = 1, fails to identify the native fold. This statement means 
that it is possible to find conformations which on the one hand are very different from the 
native one and, on the other, each amino-acid has exactly the correct number of neighbors, 
that is the same number of neighbors as in the native state. This result was first found by 
Ejtehadi et al. [2^] by exact enumeration of all the compact conformations on a 3 x 3 x 3 cubic 



lattice. For actual proteins, an example is given in Fig. [| in the case of protein CI2 (PDB 
code 2ci2), where the CMs of the native fold and of another conformation are superimposed. 
These two conformations have identical CVs. At first glance, it would seem unlikely to find 
two compact configurations where each residue has exactly the same number of neighboring 
residues in contact. On the other hand, the cautious reader will attribute this degeneracy 
to the loss of information (from N(N — l)/2 binary variables to N integers of size < N) 
associated with going from a given CM to its corresponding CV via Eq. (|3|). Quantifying 
the resulting degeneracy is a non-trivial problem. The next section is an analytical attempt 
in this direction. 

III. An analytical approach 

We ask the following question: "How many contact maps exist for a given contact vector 
n ?". In fact, we should be counting, for a given n, only the physical CM that are consistent 



with it. A physical CM p9| , |H| is one for which a perfectly matching chain configuration 
can be found. There is, however, no known analytical selection rule for the physical CMs 
among all symmetric and traceless N x N matrices; therefore in our analytic study we will 



consider all binary symmetric matrices. This is essentially the mean-field treatment of the 
problem, since in the limit of infinite dimensions, all the constraints on the CM, except being 
symmetric with zero trace, will be relaxed. For any finite dimension we overestimate the 
degeneracy - the number of physical CMs scales exponentially, as e , whereas the number 
of possible CMs scales as e N2 |[3] . 

The formal expression for the number of symmetric, traceless binary matrices consistent 
with a given vector, n, is 

i>j N 

£ lHv • (4) 

{Xij} 1 = 1 

The sum over x^ = 0,1 represents a trace over all binary matrices, and the constraint 
i > j ensures symmetry and zero trace. In order to perform the summation, we rewrite the 
Kronecker 5 discrete Fourier sum: 

N-l 



i>j N 

d(n) = E II 



1 h 

— E e i27r ^^ a!y_Tl ^ 



iV 

{ Cij } i=l L iv fc=0 

i N-l N-l N-l ( i>j \ 

= 4^E E •■■ E E^ E<fc<(E ^^ n ° • (5) 

fe 1= k 2 =0 k N =0 \{xij} ) 

Scaling fcj by N, approximate the sums by integrals. Then, evaluate the trace over the 

matrix elements, paying special attention to x^ = Xji and Xu = 0: 



d(n) 



f 1 dhdk 2 ...dk N ( Y e^XXEj**""*) ] 

2^- 1 )/ 2 | Q 1 dA; 1 dA ;2 ...^e- i27r ^^^ 1 )/ 2 -^ IJcos + %)] • (6) 



The integral can now be evaluated around its saddle points, fcj = 1/2 and fcj = 0,1, which 
contribute equally. After we set fc, = 1/2 + qi and assume iV is divisible by 4, we obtain 

-1/2 
'-1/2 

The last square term in the exponent can be eliminated by a Hubbard- transformation after 
rescaling ^ by yN and defining r/j = n-i — (N — l)/2: 

d(n) ~ 2 N ^ N -^ 2 M [ 1/2 d qi dq 2 ...dq N r dye-W+^Xi****^™-"* 3 ' 2 ^ , 

V 7T 7-1/2 J-oo 



2 iv( W -i)/2 2 ^ 1/2 dqidq2 ..j qNe -^EM( N -w-^-^m [NEiti+CEi*)*] . ( 7 ) 

7-1/2 



which finally simplifies to yield 



nN 2 /2 

d(n) ~ —e- 2 ^ 2 , (8) 



where fj and a v are the average and the standard deviation of rji = (n, — N/2). This is 
a mean-field estimate of how the degeneracy of a CV scales with respect to the statistical 
properties of the CV. The leading behavior is clearly far from being realistic, since the 
degeneracy should scale at most as z N for some z < zcm (ln(zcAf) is 0.83 in 2d |T3 and 1.32 



in 3d as calculated here). Eq. (§) further suggests that the maximally degenerate CV with 
a fixed average number of contacts has a v = 0, i.e., all the amino-acids have equal number 
of contacts, whereas an unbiased sample of CVs will be dominated by those vectors with a 
typical standard deviation of ~ y/N. The mean-field message is that the degeneracy is a 
decreasing function of a v , i.e., variation in contact number is desirable for low degeneracy. 
In the next section, we argue that this is true away from the saddle point as well. 

IV. Finite connectivity : Graph counting 

In the previous section, we allowed for the number of contacts to take any value between 
and N. In reality, and also in lattice models, the number of contacts is of order unity. 
Therefore, it is desirable to have an estimate of the degeneracy of such CVs. Once again, 
we consider all traceless, symmetric, binary N x N matrices. We first observe that every 
such matrix encodes a unique graph with N vertices, a vertex pair being connected if the 
corresponding matrix element is 1. Symmetry ensures that the graph is undirected. We can 
ensure chain connectivity (but not the graph being physical!) by freezing connections on the 
first off-diagonal; if we choose to relax these "backbone connections" , the remaining graph 
need not be connected. 

The degeneracy of a CV, can then be approximated by the number of graphs with N 
vertices and given connectivities. We imagine the vertices from 1 to N with corresponding 
number of legs sticking out of each and we ask in how many ways these legs can be connected 
such that none will be left out (the total number of legs is an even number). Eq. @ 



follows immediately if one imagines connecting pairs of legs sequentially (the numerator) and 
remembering that legs coming out of the same vertex are interchangeable (denominator). 

Let's assume we allow the entries of the CV to be one of 0, l,..,n, n <C N, and the 
composition given by {p ,pi, ..,p n }, piN = Ni being the number of amino acids with i 
contacts. The average number of contacts is J2i Wi = c - The corresponding number of 
graphs reads 

(cN - 1)" 

(The only difference with the usual Feynman diagram counting is the missing N\ in the 
denominator: our vertices are distinguishable since they correspond to the amino-acids 
labelled by their sequence number.) 

Note that this expression is an approximation to the number of symmetric traceless CMs, 
since diagrams with small loops involving one vertex, as well as with more than one line 
connecting the same two vertices are counted in Eq. (^|), even though they do not correspond 
to any CMs. However, corrections due to excluding such diagrams do not change the scaling 
with N. Applying Stirling's formula to Eq. (|9|), 



d(N, {pi}) ~ exp 



rN r 
— \nN + N(-lnc-l-^2p n \n(n\)) . (10) 



The leading order is now Zp lnN with zp = e c / 2 . Better estimations require taking into consid- 
eration the spatial correlations in the contact numbers due to the underlying one-dimensional 
chain. Our next task is to find the compositions with the minimum and maximum degen- 
eracy. The leading order in Eq. (|10"D depends only on the total number of contacts, so it is 
sensible to confine the search into the subspace of CVs with a fixed average connectivity. We 
then extremize the next order term with respect to {pi}, subject to the constraints YlPm — 1 
and mPm — c to find which distribution of contacts allows for the better "designability" 
(i.e., less degeneracy). Fig.(|J) shows the choice of {p{\ with maximum/minimum degener- 
acy obtained numerically, as a function of the average contact number, c (maximum number 
of non-backbone contacts, n, is chosen to be 4 as for the cubic lattice). As read from the 
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graphs, the highly degenerate scenario is when the number of contacts for each residue is 
minimally away from the average, and vice versa for the low degeneracy. Even though here 
we deal with low connectivity << N, whereas in the previous section we had rii ~ N, 
the result obtained here is the same as there - low degeneracy goes in parallel with maximal 
variation in contact number. 

One application of this principle is an order of magnitude estimation for the "optimal" 
length for a protein. Consider a necklace model of the protein, each residue represented 
by a sphere of fixed radius, and the necklace itself folded into a large compact sphere, 
where compactness is imposed as a necessary condition for stability. Then, maximal contact 
number fluctuation is attained when the number of buried residues equals the number of 
residues on the surface. From this purely geometric construction, one can estimate an "ideal" 
chain size: Let the radii of the individual residues be unity, and the radius of the protein 
be R. Assuming hep-like packing, each residue occupies a volume of v = 4v^2(unit) 3 , and 
those on the surface cover, roughly, a = 4(unit) 2 of surface area. Then, if N is the number 
of residues, we have vN = 4irR 3 /3 and aN/2 = AttR 2 , which yields N ~ 450. 

V. Numerical results 

To compare the analytical findings presented above with numerical simulations, we per- 
formed several exact enumeration studies on the square and the cubic lattices. Therefore, 
in this section, we deal with physical CMs and contact vectors, i.e., those generated by 
self-avoiding walks in two and three dimensions. In each analysis, we kept a record of the 
distinct CMs we encountered and the corresponding CVs. Our first observation is that, the 
number of distinct CVs for a given size N scales exponentially with iV : 

N cv ~ e a ™ N . 

For given N, the number of CVs, CMs, and self-avoiding walks increase in the given order. 
Yet, it is interesting that the growth rate a cv = 1.28 ± 0.01 is only about 3% less than the 
corresponding rate a cm = 1.32 ± 0.01 for the CMs in three dimensions (see Fig.(||), and 
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also ||13|| ). The discrepancy between the mean-field analytical calculations and the exact 
enumeration results points to the fact that the finite dimensionality and the correlations 
between contacts due to the underlying one-dimensional chain (i.e. working with physical 
CMs and CVs) are crucial. 

The almost identical growth rates is in accordance with our next analysis on the compact 
configurations on a 6 x 6 square lattice (see |34j|): Considering all the Hamiltonian walks 
inside a 6 x 6 square, we identified the number of walks that correspond to each CV and 
found that the number of CVs with degeneracy d drops more or less exponentially with d 
(see Fig.(P)). In fact, more than 96% of all the contact vectors have degeneracy d < 6, 
although it is possible to find a vector with 69 Hamiltonian walks mapped on it (not shown 
in Fig. (|6D ) • The degeneracy gets even smaller in the case of a compact but less than perfect 
packing , in our case when the 6x6 square is mostly filled with a 32-residue chain: introduced 
vacancies (especially when in the core) "label" some of the residues with otherwise identical 
contact number. Rearrangement of the core, where all the residues have identical number 
of contacts, is the dominant mechanism of degeneracy. Hence, it gets more difficult to find 
conformations with the same CV, once this degeneracy is lifted by the vacancies. In this 
case, for practically all the CVs we have d < 5. In three dimensions, for the system sizes 
within reach, practically all the residues are on the surface. Therefore, we have not extended 
this analysis to such case. 

VI. Conclusion 

Existing and future prediction methods for the accessible surface area of individual 
residues can be adopted to predict the number of native contacts of each amino acid of 
a given protein. This prediction can then be used for an efficient search of the native con- 
tact map (and the corresponding conformation) in a dramatically reduced configuration 
space. The prerequisite of such a program is to be able to identify different folds consistent 
with a given set of contact numbers for each residue. We investigated at the mean-field level 
the partition of the configuration space (or rather the contact map space) into degeneracy 
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classes labelled by the CVs. The average degeneracy predicted by the analytical calcula- 
tions disagrees with the numerical findings, indicating that the finite dimensionality and the 
correlations induced by the underlying one-dimensional chain are crucial even for a qualita- 
tively satisfactory result. We did find, already at the mean- field level, that the increasing 
the fluctuations in the native contact-numbers reduces the contact vectors' degeneracy. This 
finding is also supported by another analytical calculation, valid in a different regime, where 
the average contact number is 0(1). 

We further investigated by exact enumeration the degeneracy spectrum of CVs for self- 
avoiding walks on the square and the cubic lattice. We found that for compact self-avoiding 
walks the CM and the CV representations carry nearly the same amount of information. 
This is an encouraging result, for an accurate enough prediction of solvent exposed surface 
areas in the native state may then be used to reduce the search space sufficiently, so that 
within the limited set of remaining candidate CMs a simple pairwise interaction potential 
may suffice to single out the native fold of the protein. In addition, we performed exact 
enumeration over all SAWs of iV < 16 steps in three dimension, and found that the number 
of CVs grows exponentially with the protein length, with a prefactor only a few percent 
smaller than that for the CMs. The slow exponential growth of the average degeneracy 
of the CVs is largely overestimated by our mean-field calculations. Further analytical and 
numerical research is certainly called for. We also observed that for compact configurations, 
CV — > CM mapping is practically one-to-few. The Hamiltonian in Eq. ([!]), therefore, may 
still be promising if the pairwise interactions are optimized within the context of a (even 
roughly) predicted CV. 
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FIGURES 

FIG. 1. The contact map is a binary representation of the three-dimensional structure of the 
folded protein. The contact vector is constructed by summing up the rows of the contact map. 

FIG. 2. Hidden surface area per residue (after appropriate scaling) vs number of contacts per 
residue. The histogram is obtained by averaging over 177 representative proteins with a threshold 
on the C a distance between amino-acid pairs ||. The number of occurences is gray-scale coded, 
increasing from to 500. The coefficient of correlation is 0.8. 

FIG. 3. The native contact map (red) of protein CI2 (PDB code 2ci2) and a non-native 
map (blue) with the same contact vector are overlapped. For clarity, the symmetric half of the 
non-native contact map is omitted. 

FIG. 4. Solution of the mean-field equations for the maximal and minimal degeneracy of 
contact vectors with finite average contact number. Maximum contact number is chosen to be 4 as 
for the cubic lattice. The solution for each pi is drawn within the corresponding horizontal band. 
The y-axis of each band is labeled on the left and right alternatingly. For fixed average contact 
number, c, lowest degeneracy is when the standard deviation in the contact numbers is maximal, 
and vice versa. 

FIG. 5. Scaling of number of contact maps and vectors with chain length, obtained by exact 
enumeration on the three dimensional cubic lattice. The inset (also a log-plot) shows clearly that 
the deviation between the two growth rates is real. 

FIG. 6. Degeneracty of the contact vector on a 6 x 6 square. The upper graph is a log-plot 
of the number of distinct contact vectors with the degeneracy given on the x-axis. The lower 
graph shows the fraction of SAWs of size 32 and 36 inside the square, covered by the subset of 
corresponding contact vectors with degeneracy < x. 
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